{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Export sparse feature embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在我们了解TensorNet保存的`sparse_table`目录中的数据内容及格式，以方便导出成字典供在线使用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "MODEL_DIR = '/tmp/wide-deep-test/model'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/tmp/wide-deep-test/model/sparse_table\r\n",
      "├── 0\r\n",
      "│   └── rank_0\r\n",
      "│       ├── sparse_block_0.gz\r\n",
      "│       ├── sparse_block_1.gz\r\n",
      "│       ├── sparse_block_2.gz\r\n",
      "│       ├── sparse_block_3.gz\r\n",
      "│       ├── sparse_block_4.gz\r\n",
      "│       ├── sparse_block_5.gz\r\n",
      "│       ├── sparse_block_6.gz\r\n",
      "│       └── sparse_block_7.gz\r\n",
      "└── 1\r\n",
      "    └── rank_0\r\n",
      "        ├── sparse_block_0.gz\r\n",
      "        ├── sparse_block_1.gz\r\n",
      "        ├── sparse_block_2.gz\r\n",
      "        ├── sparse_block_3.gz\r\n",
      "        ├── sparse_block_4.gz\r\n",
      "        ├── sparse_block_5.gz\r\n",
      "        ├── sparse_block_6.gz\r\n",
      "        └── sparse_block_7.gz\r\n",
      "\r\n",
      "4 directories, 16 files\r\n"
     ]
    }
   ],
   "source": [
    "! tree $MODEL_DIR/sparse_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`sparse_table`目录下会有多个目录，目录的个数由`tn.layers.EmbeddingFeatures`调用的次数确定，在我们tutorial中的例子中`wide`部分调用了一次，`deep`中调用了一次，那么便有两个目录。目录的名称依照调用的先后顺序从0开始命名，这个例子里面名称为`0`的目录下保存的是`wide`部分的embedding数据，为`1`的目录下保存的是`deep`部分的embedding数据。子目录中分别保存了每个节点的数据，以`rank_*`命名，在我们这个例子中使用单节点运行，所以只有`rank_0`。在每个节点的数据中分别会有8个block，每个block中的数据格式一致。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## sparse embedding数据格式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "sparse block中的数据全部按`tab`键分隔，可以通过下面命令查看"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "opt_name:AdaGrad\r\n",
      "dim:8\r\n",
      "63927\t8\t0.0262204\t-0.0414651\t0.0461724\t0.0260017\t0.0613893\t-0.0325357\t0.0551388\t-0.00449165\t0.1\t1\t0.98\r\n",
      "61514\t8\t0.0209959\t-0.0770077\t-0.0248773\t0.016569\t0.0071595\t0.0478604\t0.0274112\t0.0725264\t0.1\t1\t0.98\r\n",
      "56580\t8\t0.00379409\t-0.0978684\t0.0398026\t-0.0278145\t-0.00481733\t-0.00540131\t-0.0336508\t0.0101625\t0.1\t1\t0.98\r\n",
      "51391\t8\t0.0342308\t-0.00472191\t-0.0216889\t0.0170641\t0.00393812\t-0.007634\t0.0107123\t0.0233057\t0.1\t1\t0.98\r\n",
      "41190\t8\t-0.0501618\t-0.0142409\t-0.0427884\t-0.064903\t0.0422692\t-0.0217611\t0.0552286\t0.0355111\t0.1\t1\t0.98\r\n",
      "35619\t8\t0.0202833\t-0.00314469\t-0.00274868\t-0.0165426\t0.00438455\t-0.0344267\t0.0173564\t0.0341289\t0.1\t1\t0.98\r\n",
      "31504\t8\t0.0344835\t-0.00100818\t0.0224287\t-0.0199555\t-0.0218565\t-0.0594322\t-0.0253813\t0.0232026\t0.1\t1\t0.98\r\n",
      "25596\t8\t-0.0139298\t-0.0488882\t0.0384313\t0.0378851\t0.00378205\t0.0485842\t-0.080289\t-0.0162278\t0.1\t1\t0.98\r\n",
      "\r\n",
      "gzip: stdout: Broken pipe\r\n",
      "cat: write error: Broken pipe\r\n"
     ]
    }
   ],
   "source": [
    "!cat $MODEL_DIR/sparse_table/1/rank_0/sparse_block_0.gz | gunzip | head -10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "文件的头两行中写了优化器的名称及用户设置的embedding的维度，之后的每一行都是具体的数据了。每一行的格式如下："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| sign    | dimension | embedding1 | embedding2 | ... | embeddingN | 优化器超参1| ...      | 优化器超参N | version | show_count |\n",
    "|---------|-----------|------------|------------|-----|------------|------------|----------|-------------| --------| -----------|\n",
    "| 63927  | 8         | 0.0262204  | -0.0414651 | ... |-0.00449165 | 0.1        | ...      |  0          | 1       | 0.98       | \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "每个优化器上所带的超参可能会不同，具体的可以参考`core/ps/optimizer/ada_grad_kernel.cc`和`core/ps/optimizer/adam_kernel.cc`中的实现，我们在线预估真正关注的只是前面的embedding数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 制作在线字典"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上面的sparse table的格式比较明确，那么使用mapreduce或者spark很容易截取出我们需要的数据，以`sign`为key，embedding数据为value保存成字典就可以了。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在实际中，sparse embedding的数据会非常大，很可能会接近TB，我们内部是使用一个分布式哈希表将这个数据存储在多台机器上，然后在线使用rpc获取具体的embedding数据。如果没有精力实现分布式哈希表，则可以参考数据的最后一列`show_count`，将出现次数较少的特征砍掉以缩减字典的大小。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
